A Vector Perspective
The Vector-Centric View of Backpropagation
In this section, we will discuss the vector-centric view of backpropagation. First, we recap the vector architecture of neural networks discussed in Chapter 1. The corresponding weight matrices between successive layers are denoted by . Let be the -dimensional column vector corresponding to the input, be the column vectors corresponding to the hidden layers, and be the -dimensional column vector corresponding to the output. Then, we have the following recurrence condition for multi-layer networks:
The function is applied in an element-wise fashion. A key problem here is that each of the above functions is a vector composition of a linear layer and a nonlinear activation layer. Furthermore, the output is an even more complex composition function of earlier layers.
A comparison of the scalar architecture is provided with the vector architecture in Figure. It is noteworthy that the connection matrix between the input layer and the first hidden layer is of size 5x3, since there are 5 inputs. However, in order to apply the linear transformation to the 5-dimensional column vector , the weight matrix will be of size 3x5, so that is a 3-dimensional vector. A key point is that the entire neural network can be expressed as a single path with vector-centric operations, which greatly simplifies the topology of the computational graph. However, all functions in the vector-centric view of neural networks are vector-to-vector functions. Therefore, one needs to use the vector-to-vector derivatives and a corresponding chain rule.
A summary of derivative with respect to vectors
The backpropagation algorithm requires the computation of node-to-node derivatives and loss-to-node derivatives as intermediate steps. In the vector-centric view, one wants to compute derivatives with respect to entire layers of nodes. To do so, one must use matrix calculus notation, which allows derivatives of scalars and vectors with respect to other vectors.
For example, consider the case where one wishes to compute the derivative of a scalar loss with respect to a vector layer , where is a -dimensional column vector. The derivative of a scalar with respect to a column vector is another column vector, using the denominator layout convention of matrix calculus. This derivative is denoted by and is simply the gradient. This notation is a scalar-to-vector derivative, which always returns a vector. Therefore, we have the following:
The matrix calculus notation also allows derivatives of vectors with respect to vectors. For example, the derivative of an -dimensional column vector with respect to a -dimensional column vector is a matrix in the denominator layout convention. The th entry of this matrix is the derivative of with respect to : This matrix is closely related to the Jacobian matrix in matrix calculus. The th element of the Jacobian is always and therefore it is the transpose of the matrix shown in Equation : The transposition occurs because of the use of denominator layout. In fact the Jacobian matrix is exactly equal to in the numerator layout convention of matrix calculus. However, we will consistently work with the denominator convention in this book.
Two special cases of vector-to-vector derivatives are very useful in neural networks. The first is the linear propagation , which occurs in linear layers of neural networks. In such a case, can be shown to be . The second is when an element-wise activation function is applied to the -dimensional input vector to create another -dimensional vector. In such a case, can be shown to be the diagonal matrix in which the th diagonal entry is the derivative . Here, is the th component of the vector .
2.5.2 Vector-Centric Chain Rule
This section delves into the "Vector-Centric Chain Rule", a fundamental concept in calculus that's instrumental for comprehending the inner workings of neural networks.
Context:
Neural networks, especially those that are deep, are structured with numerous layers. Each of these layers can be visualized as a function. In the case of a deep neural network, its final output is essentially the result of these functions' compositions. Mathematically, for a neural network with "k" layers, the output for an input is given as:
Theorem 2.5.1 (Vectored Chain Rule):
This theorem outlines a systematic approach to compute the derivative of the overall neural network output (a vector) in relation to its input (another vector). Here's a breakdown:
Composition Function: The function represents the outcome when all the neural network functions (layers) are composed. Expressed as:
Function Characteristics: Every function, denoted as , within the composition accepts an -dimensional vector and yields an -dimensional vector. The initial input, , is an -dimensional vector, whereas the final output, , is of dimension .
Notation: The output generated by each function is represented as .
Vectored Chain Rule Expression: This core aspect of the theorem stipulates that the derivative of with respect to can be represented as a matrix product. Each matrix stands for a partial derivative of a layer's output concerning the input of its preceding layer.
The mathematical expression is:
Implication:
The concluding mention of size constraints is an assurance that the matrix multiplications are valid. This refers to the basic rule of matrix multiplication, where the number of columns in the first matrix must align with the number of rows in the subsequent matrix. Owing to the inherent structure of neural networks, these constraints are automatically satisfied, validating the computation of matrix derivatives through multiplication.